Inference for Approximating Regression Models
نویسندگان
چکیده
The assumptions underlying the Ordinary Least Squares (OLS) model are regularly and sometimes severely violated. In consequence, inferential procedures presumed valid for OLS are invalidated in practice. We describe a framework that is robust to model violations, and describe the modifications to the classical inferential procedures necessary to preserve inferential validity. As the covariates are assumed to be stochastically generated ("Random-X"), the sought after criterion for coverage becomes marginal rather than conditional. We focus on slopes, mean responses, and individual future observations. For slopes and mean responses, the targets of inference are redefined by means of least squares regression at the population level. The partial slopes that that regression defines, rather than the slopes of an assumed linear model, become the population quantities of interest, and they can be estimated unbiasedly. Under this framework, we estimate the Average Treatment Effect (ATE) in Randomized Controlled Studies (RCTs), and derive an estimator more efficient than one commonly used. We express the ATE as a slope coefficient in a population regression and immediately prove unbiasedness that way. For the mean response, the conditional value of the best least squares approximation to the response surface in the population rather than the conditional value of y, is aimed to be captured. A calibration through pairs bootstrap can markedly improve such coverage. Moving to observations, we show that when attempting to cover future individual responses, a simple in-sample calibration technique that widens the empirical interval to contain $(1-\alpha)*100\%$ of the sample residuals is asymptotically valid, even in the face of gross model violations. OLS is startlingly robust to model departures when a future y needs to be covered, but nonlinearity, combined with a skewed X-distribution, can severely undermine coverage of the mean response. Our ATE estimator dominates the common estimator, and the stronger the R squared of the regression of a patient's response on covariates, treatment indicator, and interactions, the better our estimator's relative performance. By considering a regression model as a semiparametric approximation to a stochastic mechanism, and not as its description, we rest assured that a coverage guarantee is a coverage guarantee. Degree Type Dissertation Degree Name Doctor of Philosophy (PhD) Graduate Group Statistics First Advisor Lawrence D. Brown
منابع مشابه
Partially Improper Gaussian Priors for Nonparametric Logistic Regression
A \partially improper" Gaussian prior is considered for Bayesian inference in logistic regression. This includes generalized smoothing spline priors that are used for nonparametric inference about the logit, and also priors that correspond to generalized random e ect models. Necessary and su cient conditions are given for the posterior to be a proper probability measure, and bounds are given fo...
متن کاملAnalysis of the Posterior for
A \partially improper" Gaussian prior is considered for Bayesian inference in logistic regression. This includes generalized smoothing spline priors that are used for nonparametric inference about the logit, and also priors that correspond to generalized linear mixed models. Necessary and su cient conditions are given for the posterior to be a proper probability measure, and bounds are given fo...
متن کاملBayesian Inference for Spatial Beta Generalized Linear Mixed Models
In some applications, the response variable assumes values in the unit interval. The standard linear regression model is not appropriate for modelling this type of data because the normality assumption is not met. Alternatively, the beta regression model has been introduced to analyze such observations. A beta distribution represents a flexible density family on (0, 1) interval that covers symm...
متن کاملArtificial intelligence-based approaches for multi-station modelling of dissolve oxygen in river
ABSTRACT: In this study, adaptive neuro-fuzzy inference system, and feed forward neural network as two artificial intelligence-based models along with conventional multiple linear regression model were used to predict the multi-station modelling of dissolve oxygen concentration at the downstream of Mathura City in India. The data used are dissolved oxygen, pH, biological oxygen demand and water...
متن کاملPrediction of soil cation exchange capacity using support vector regression optimized by genetic algorithm and adaptive network-based fuzzy inference system
Soil cation exchange capacity (CEC) is a parameter that represents soil fertility. Being difficult to measure, pedotransfer functions (PTFs) can be routinely applied for prediction of CEC by soil physicochemical properties that can be easily measured. This study developed the support vector regression (SVR) combined with genetic algorithm (GA) together with the adaptive network-based fuzzy infe...
متن کاملUncertainty Quality | Uncertainty in Deep Learning
In this chapter we assess the techniques developed in the previous chapters, concentrating on questions such as what our model uncertainty looks like. We experiment with different model architectures and approximating distributions, and use various regression and classification settings. Assessing the models’ confidence quantitatively we can see how much we sacrifice in our attempt at deriving ...
متن کامل